2 UCB 1 Policy for Multi - arm Bandit Problem
نویسنده
چکیده
To quickly recap the setup: we have 1, · · · , n arms and for each arm i, we have an associated random variable Xi bounded in [0, 1] and our goal is to minimize regret (maximize reward) where regret is defined wrt the best fixed strategy a posteriori. Every time we play arm i, the reward is picked independently and randomly with distribution Xi. We have shown in previous lectures that if we restrict ourselves in the adversarial setting, we can do at best O( √ nT log T ) so we would like to switch gears to look at stochastic inputs to minimize expected regret. The highlight of our UCB 1 policy is that we will have an almost ”to good to be true” upper bound of O(log T ). To motivate the UCB-1 policy, let’s consider the following naive strategy which will achieve O(nT 2 3 lnnT ) regret.
منابع مشابه
Best Arm Identification in Multi-Armed Bandits
We consider the problem of finding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regre...
متن کاملUCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem
ABSTRACT. In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in Karmed bandits after T trials is bounded by const · K log(T ) , where measures the distance between a suboptimal arm an...
متن کاملCornering Stationary and Restless Mixing Bandits with Remix-UCB
We study the restless bandit problem where arms are associated with stationary φ-mixing processes and where rewards are therefore dependent: the question that arises from this setting is that of carefully recovering some independence by ‘ignoring’ the values of some rewards. As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off,...
متن کاملUCB Algorithm for Exponential Distributions
We introduce in this paper a new algorithm for Multi-Armed Bandit (MAB) problems. A machine learning paradigm popular within Cognitive Network related topics (e.g., Spectrum Sensing and Allocation). We focus on the case where the rewards are exponentially distributed, which is common when dealing with Rayleigh fading channels. This strategy, named Multiplicative Upper Confidence Bound (MUCB), a...
متن کاملRegional Multi-Armed Bandits
We consider a variant of the classic multiarmed bandit problem where the expected reward of each arm is a function of an unknown parameter. The arms are divided into different groups, each of which has a common parameter. Therefore, when the player selects an arm at each time slot, information of other arms in the same group is also revealed. This regional bandit model naturally bridges the non...
متن کامل